{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 The Google Research Authors.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at\n",
    "\n",
    "     http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software\n",
    "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "See the License for the specific language governing permissions and\n",
    "limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Domain Adaptation using DVRL\n",
    "\n",
    " * Jinsung Yoon, Sercan O Arik, Tomas Pfister, \"Data Valuation using Reinforcement Learning\", arXiv preprint arXiv:1909.11671 (2019) - https://arxiv.org/abs/1909.11671\n",
    " \n",
    "This notebook describes the user-guide of a domain adaptation application using \"Data Valuation using Reinforcement Learning (DVRL)\". \n",
    "\n",
    "We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing sets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset.\n",
    "\n",
    "You need:\n",
    "\n",
    "**Source / Target / Validation Datasets** \n",
    " * If there is no explicit validation set, users can utilize a small portion of target set as the validation set and the remaining as the target set.\n",
    " * If users come with their own source / target / validation datasets, the users should save those files as 'source.csv', 'target.csv', 'valid.csv' in './repo/data_files/' directory.\n",
    " \n",
    "## Requirements\n",
    "\n",
    "  We use Rossmann store sales dataset (https://www.kaggle.com/c/rossmann-store-sales) as an example in this notebook. Please download the dataset (rossmann-store-sales.zip) from the following link (https://www.kaggle.com/c/rossmann-store-sales/data) and save it to './repo/data_files/' directory after cloning github repository."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Prerequisite\n",
    "\n",
    " * Download lightgbm package.\n",
    " * Clone https://github.com/google-research/google-research.git to the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uses pip3 to install necessary package (lightgbm)\n",
    "!pip3 install lightgbm\n",
    "\n",
    "# Resets the IPython kernel to import the installed package.\n",
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from git import Repo\n",
    "\n",
    "# Current working directory\n",
    "repo_dir = os.getcwd() + '/repo'\n",
    "\n",
    "if not os.path.exists(repo_dir):\n",
    "    os.makedirs(repo_dir)\n",
    "\n",
    "# Clones github repository\n",
    "if not os.listdir(repo_dir):\n",
    "    git_url = \"https://github.com/google-research/google-research.git\"\n",
    "    Repo.clone_from(git_url, repo_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Necessary packages and functions call\n",
    "\n",
    " * rossmann_data_loading: Data loader for rossmann dataset.\n",
    " * data_preprocess: Data extraction and normalization.\n",
    " * dvrl_regress: Data valuation function for regression problem.\n",
    " * metrics: Evaluation metrics of the quality of data valuation in domain adaptation setting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "import lightgbm\n",
    "\n",
    "# Sets current directory\n",
    "os.chdir(repo_dir)\n",
    "\n",
    "from dvrl.data_loading import load_rossmann_data, preprocess_data\n",
    "from dvrl import dvrl\n",
    "from dvrl.dvrl_metrics import learn_with_dvrl, learn_with_baseline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data loading & Select source, target, validation datasets\n",
    "\n",
    " * Load source, target, validation dataset and save those datasets as source.csv, target.csv, valid.csv in './repo/data_files/' directory.\n",
    " * If users have their own source.csv, target.csv, valid.csv, the users can skip this cell and just save those files to './repo/data_files/' directory .\n",
    " \n",
    "**Input**: \n",
    " * dict_no: The number of source / valid / target samples. We use 79% / 1% / 20% as the ratio of each dataset.\n",
    " * settings: 'train-on-all', 'train-on-rest', 'train-on-specific'.\n",
    " * target_store_type: Target store types ('A','B','C','D').\n",
    " \n",
    "For instance, to evaluate the performance of store type 'A', (1) 'train-on-all' setting uses the entire source dataset, (2) 'train-on-rest' setting uses the source samples with store type 'B', 'C', and 'D', (3) 'train-on-specific' setting uses the source samples with store type 'A'. Therefore, 'train-on-rest' has the maximum distribution differences between source and target datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  if (yield from self.run_code(code, result)):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished data loading.\n"
     ]
    }
   ],
   "source": [
    "# The number of source / validation / target samples (79%/1%/20%)\n",
    "dict_no = dict()\n",
    "dict_no['source'] = 667027 # 79% of data\n",
    "dict_no['valid'] = 8443 # 1% of data\n",
    "\n",
    "# Selects a setting and target store type\n",
    "setting = 'train-on-rest'\n",
    "target_store_type = 'B'\n",
    "\n",
    "# Loads data and selects source, target, validation datasets\n",
    "load_rossmann_data(dict_no, setting, target_store_type)\n",
    "\n",
    "print('Finished data loading.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data preprocessing\n",
    "\n",
    " * Extract features and labels from source.csv, valid.csv, target.csv in './repo/data_files/' directory.\n",
    " * Normalize the features of source, validation, and target sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.\n",
      "  return self.partial_fit(X, y)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished data preprocess.\n"
     ]
    }
   ],
   "source": [
    "# Normalization methods: either 'minmax' or 'standard'\n",
    "normalization = 'minmax' \n",
    "\n",
    "# Extracts features and labels. Then, normalizes features.\n",
    "x_source, y_source, x_valid, y_valid, x_target, y_target, _ = \\\n",
    "preprocess_data(normalization, 'source.csv', 'valid.csv', 'target.csv')\n",
    "\n",
    "print('Finished data preprocess.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run DVRL\n",
    "\n",
    "1. **Input**: \n",
    "\n",
    " * data valuator network parameters: Set network parameters of data valuator.\n",
    " * pred_model: The predictor model that maps output from the input. Any machine learning model (e.g. a neural network or ensemble decision tree) can be used as the predictor model, as long as it has fit, and predict (for regression)/predict_proba (for classification) as its subfunctions. Fit can be implemented using multiple backpropagation iterations. \n",
    "\n",
    " \n",
    "2. **Output**:\n",
    " * data_valuator: Function that uses training set as inputs to estimate data values.\n",
    " * dvrl_predictor: Function that predicts labels of the testing samples.\n",
    " * dve_out: Estimated data values of the entire training samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
      "For more information, please see:\n",
      "  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
      "  * https://github.com/tensorflow/addons\n",
      "If you depend on functionality not listed there, please file an issue.\n",
      "\n",
      "WARNING:tensorflow:From /usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Colocations handled automatically by placer.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1000/1000 [14:26<00:00,  1.22it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Use standard file APIs to check for files with this prefix.\n",
      "INFO:tensorflow:Restoring parameters from ./tmp/model.ckpt\n",
      "Finished data valuation.\n"
     ]
    }
   ],
   "source": [
    "# Resets the graph\n",
    "tf.reset_default_graph()\n",
    "\n",
    "# Defines the problem\n",
    "problem = 'regression'\n",
    "\n",
    "# Network parameters\n",
    "parameters = dict()\n",
    "parameters['hidden_dim'] = 100\n",
    "parameters['comb_dim'] = 10\n",
    "parameters['iterations'] = 1000\n",
    "parameters['activation'] = tf.nn.tanh\n",
    "parameters['layer_number'] = 5\n",
    "parameters['batch_size'] = 50000\n",
    "parameters['learning_rate'] = 0.001\n",
    "\n",
    "# Defines predictive model\n",
    "pred_model = lightgbm.LGBMRegressor()\n",
    "\n",
    "# Sets checkpoint file name\n",
    "checkpoint_file_name = './tmp/model.ckpt'\n",
    "\n",
    "# Defines flag for using stochastic gradient descent / pre-trained model\n",
    "flags = {'sgd': False, 'pretrain': False}\n",
    "\n",
    "# Initializes DVRL\n",
    "dvrl_class = dvrl.Dvrl(x_source, y_source, x_valid, y_valid, problem, pred_model, parameters, checkpoint_file_name, flags)\n",
    "\n",
    "# Trains DVRL\n",
    "dvrl_class.train_dvrl('rmspe')\n",
    "\n",
    "# Estimates data values\n",
    "dve_out = dvrl_class.data_valuator(x_source, y_source)\n",
    "\n",
    "# Predicts with DVRL\n",
    "y_target_hat = dvrl_class.dvrl_predictor(x_target)\n",
    "\n",
    "print('Finished data valuation.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluations\n",
    "\n",
    " * In this notebook, we use LightGBM as the predictor model for evaluation purposes (but you can also replace it with another model).\n",
    " * Here, we use Root Mean Squared Percentage Error (RMSPE) as the performance metric."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DVRL Performance\n",
    "\n",
    "DVRL learns robustly although the training data has different distribution from the target data distribution, using the guidance from the small validation data (which comes from the target distribution) via reinforcement learning.\n",
    " * Train predictive model with weighted optimization using estimated data values by DVRL as the weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished evaluation.\n",
      "DVRL learning performance: 0.2955\n",
      "Baseline performance: 0.8471\n"
     ]
    }
   ],
   "source": [
    "# Defines evaluation model\n",
    "eval_model = lightgbm.LGBMRegressor()\n",
    "\n",
    "# DVRL-weighted learning\n",
    "dvrl_perf = learn_with_dvrl(dve_out, eval_model, \n",
    "                            x_source, y_source, x_valid, y_valid, x_target, y_target, 'rmspe')\n",
    "\n",
    "# Baseline prediction performance (treat all training samples equally)\n",
    "base_perf = learn_with_baseline(eval_model, x_source, y_source, x_target, y_target, 'rmspe')\n",
    "\n",
    "print('Finished evaluation.')\n",
    "print('DVRL learning performance: ' + str(np.round(dvrl_perf, 4)))\n",
    "print('Baseline performance: ' + str(np.round(base_perf, 4)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
